Appendix D — Assignment 4

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Friday, 23th May 2025 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  6. Please make sure your code results are clearly incorporated in your submitted HTML file.

Feel free to add data visualizations of your hyperparameter tuning process. Visualizing and analyzing tuning results is important—even if it’s not explicitly required in the instructions.

D.1 AdaBoost vs Bagging (4 points)

Which model among AdaBoost and Random Forest is more sensitive to outliers? (1 point) Explain your reasoning with the theory you learned on the training process of both models. (3 points)

D.2 Regression with Boosting (54 points)

For this question, you will use the miami_housing.csv file. You can find the description for the variables here.

The SALE_PRC variable is the regression response and the rest of the variables, except PARCELNO, are the predictors.

D.2.1 a): Preprocessing

Read the dataset. Create the training and test sets with a 60%-40% split and random_state = 1. (1 point)

D.2.2 b) AdaBoost

Tune an AdaBoost Regressor to achieve a test MAE below $47,000.

  • You must set random_state=1 for all components (e.g., base estimator, AdaBoost model, etc.).
  • Submissions that meet the MAE cutoff using any other random_state will receive zero credit.

Scoring: - 5 points for achieving test MAE < $47,000 - 1 point for reporting the training MAE of your tuned model to evaluate generalization

D.2.3 c) Loss Functions in Gradient Boosting

Gradient Boosting supports multiple loss functions, including squared_error, absolute_error, and huber.

  • (1 point) Which loss function performs best on this dataset?
  • (3 points) What are the advantages of this loss function compared to the other two?

D.2.4 Task: Tune a Gradient Boosting Model

Your goal is to tune a Gradient Boosting Regressor to achieve a cross-validation MAE below $45,000.

  • You must keep all random_state values set to 1.
  • Submissions using any other random_state will receive zero credit, even if the MAE cutoff is met.

Scoring (10 points total):
- 5 points for using a well-reasoned hyperparameter search strategy
- 5 points for achieving MAE < $45,000 - 1 point for reporting the training MAE of your tuned model to evaluate generalization

Hints

  • Parallel processing is not supported in the vanilla GradientBoostingRegressor.
  • BayesSearchCV, like gradient boosting itself, performs a sequential search—each trial depends on the result of the previous one—so it does not support parallel exploration.
  • Optuna is generally faster and more efficient than both BayesSearchCV and GridSearchCV. It supports parallel execution of trials and includes several built-in performance enhancements.

D.2.5 d) XGBoost vs. Gradient Boosting

XGBoost Enhancements:

  • What improvements make XGBoost superior to vanilla Gradient Boosting in terms of performance and runtime?
    • Explain the enhancements (1 point)
    • Provide the reasons behind the improvements (1 point)
    • Identify relevant hyperparameters and describe how they influence model behavior (2 points)

XGBoost Limitations:

  • What important feature or behavior is missing in XGBoost but well-implemented in vanilla Gradient Boosting? (1 point)

D.2.6 e) Tuning XGBoost with Different Search Strategies

Tune an XGBoost Regressor to achieve a cross-validation MAE below $42,500.

  • You must keep random_state=1 in all components (e.g., XGBoost model, CV splits, search objects).
  • Submissions that meet the cutoff using any other random_state will receive zero credit.

Scoring (10 points total):

  • 5 points for a well-designed and appropriate hyperparameter search strategy
  • 5 using 3 different search strategies
  • 5 points for achieving MAE < $42,500

Search Strategies (Required Comparison)

You must tune the model using three different search settings:

  1. BayesSearchCV

    • Unlike vanilla GradientBoostingRegressor, XGBoost supports parallel training and can benefit from multi-core processing (n_jobs=-1), so BayesSearchCV is practica with it.
  2. Optuna (with n_jobs=-1)

  3. Optuna (default single-threaded)

Execution Time

You must report the execution time for each tuning strategy.

  • You can measure this using:
    • A Jupyter magic command like %%time, or
    • Python’s time.time() (end - start)

For a fair comparison, use the same search space across all methods.
Only one of the tuned models needs to meet the performance cutoff, but you should still report times for all three.

D.2.7 f) Feature Importance

Using the best hyperparameter settings, fit the final model and output the feature importances.

  • Use the .feature_importances_ attribute or equivalent method from your model.
  • Visualize the importances if possible (e.g., with a bar plot).

D.3 Imbalanced Classification with Regularized Gradient Boosting (42 points)

In this question, you will use the train.csv and test.csv datasets. Each observation represents a marketing call made by a banking institution. The target variable y indicates whether the client subscribed to a term deposit (1) or not (0), making this a binary classification task.

The predictors you should use are: age, day, month, and education.

⚠️ Note: As discussed last quarter, the variable duration must not be used as a predictor.
No credit will be given for models that include it.

D.3.1 a) Data Preprocessing

Perform the following preprocessing steps:

  • Read in the training and testing datasets.
  • Create a new season feature by mapping each month to its corresponding season.
  • Define the predictor and response variables.
  • Convert all categorical predictors to pandas.Categorical dtype before passing them to the models.
  • Convert the response variable y to binary values (0 and 1).

(5 points)

We will rely on the native categorical feature support provided by each library (XGBoost, LightGBM, and CatBoost), so explicit one-hot encoding is not required.

D.3.2 b) Target Exploration

For classification tasks, it’s important to examine the distribution of the target variable to determine whether the classes are imbalanced. This helps you avoid common pitfalls when dealing with imbalanced classification.

  • Explore the class distribution in both the training and test sets.

(2 points)

D.3.3 c) LightGBM and CatBoost

LightGBM and CatBoost are gradient boosting frameworks, like XGBoost, but each introduces unique innovations.

  • What do LightGBM and CatBoost have in common with XGBoost? (2 points)
  • What advantages do they offer over XGBoost? (2 points)
  • How are these advantages implemented in each model? (3 points)
  • All three libraries support native categorical feature handling.
    Do they use the same approach? If not, explain the differences. (3 points)

D.3.4 c) Handling Imbalanced Classification in Gradient Boosting Extensions

For all extensions of Gradient Boosting (XGBoost, LightGBM, and CatBoost):

  • Are there additional inputs or hyperparameters available to handle imbalanced classification? (1 point)
  • If yes, describe how the method works. (1 point)
  • How should the value of this hyperparameter be set or tuned for best results? (1 point)

D.3.5 d) Model Evaluation: XGBoost, LightGBM, and CatBoost

Evaluate the performance of the following models: XGBoost, LightGBM, and CatBoost, using the metrics below:

  • Recall
  • Precision
  • F1 Score
  • AUPRC (Area Under the Precision-Recall Curve)
  • ROC AUC

For each model, build and compare two versions:

  1. Baseline model: using default settings with random_state=1, without addressing class imbalance.
  2. Imbalance-aware model: with scale_pos_weight enabled to handle class imbalance.
  • Compare the performance of both versions for each model.
  • Summarize which model and approach performed best for imbalanced classification, and try to explain why.

D.3.6 d) Tuning LightGBM for Classification

Tune a LightGBM classifier to achieve:

  • Cross-validation accuracy ≥ 70%
  • Cross-validation recall ≥ 65%

You must set random_state=1 in all components (e.g., model, cross-validation, search objects).
Submissions that exceed the cutoffs using any other random_state will receive zero credit.

Scoring (15 points total):
- 7.5 points for a well-designed and justified search strategy
- 7.5 points for meeting both performance thresholds

Hints:

  • For classification, you may also tune the decision threshold (not just model hyperparameters).

D.3.7 e) Test Set Evaluation

Evaluate the tuned LightGBM model on the test set:

  • Report the test accuracy and test recall.
  • Include the threshold used for classification.

This will help assess how well the model generalizes beyond the training data.

(2 points)

D.3.8 f) Tuning CatBoost for Classification

Tune a CatBoost classifier to achieve:

  • Cross-validation accuracy ≥ 75%
  • Cross-validation recall ≥ 65%

You must set random_state=1 in all components (e.g., model, cross-validation, search objects).
Submissions that exceed the cutoffs using any other random_state will receive zero credit.

Scoring (15 points total):
- 7.5 points for a well-structured and appropriate hyperparameter search
- 7.5 points for meeting both performance thresholds

Hints:

  • You are free to use any tuning strategy and define any reasonable search space.
  • In addition to tuning hyperparameters, you may also need to tune the decision threshold to meet the classification performance criteria.

D.3.9 g) Test Set Evaluation

Evaluate the tuned CatBoost model on the test set:

  • Report the test accuracy and test recall.
  • Include the classification threshold used.

This will help assess whether the model generalizes well beyond the training data.
(1 point)

D.4 🎁 Bonus (Extra Credit) – 20 Points

To help you prepare for your upcoming prediction project involving hyperparameter tuning, I’ve created the following optional tasks.
Feel free to skip them if time does not permit.

D.4.1 a) Comparing Tuning Strategies

Compare the tuning time and results of GridSearchCV and RandomizedSearchCV using the same search space you used in Task 2e (BayesSearchCV and Optuna).

  • What are the trade-offs between exhaustive search, random search, and smarter strategies like Bayesian optimization and Optuna?
  • Are the differences in runtime justified by improvements in model performance?

D.4.2 b) Resumable Tuning Strategies

Do your own research: Among all the tuning strategies you have used, which ones allow you to continue tuning without starting from scratch when increasing n_trials or n_iter?

  • Identify the methods that support incremental or resumable search.
  • Explain how they work and why they are efficient.
  • Provide code to demonstrate how these strategies reuse previous results rather